Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 14 de 14
Filter
Add more filters










Publication year range
1.
Genome Res ; 34(3): 454-468, 2024 Apr 25.
Article in English | MEDLINE | ID: mdl-38627094

ABSTRACT

Reference-free genome phasing is vital for understanding allele inheritance and the impact of single-molecule DNA variation on phenotypes. To achieve thorough phasing across homozygous or repetitive regions of the genome, long-read sequencing technologies are often used to perform phased de novo assembly. As a step toward reducing the cost and complexity of this type of analysis, we describe new methods for accurately phasing Oxford Nanopore Technologies (ONT) sequence data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of ONT PromethION sequencing, including those using proximity ligation, and show that newer, higher accuracy ONT reads substantially improve assembly quality.


Subject(s)
Nanopores , Humans , Sequence Analysis, DNA/methods , Nanopore Sequencing/methods , High-Throughput Nucleotide Sequencing/methods , Software , Genomics/methods
2.
Nat Methods ; 20(10): 1483-1492, 2023 10.
Article in English | MEDLINE | ID: mdl-37710018

ABSTRACT

Long-read sequencing technologies substantially overcome the limitations of short-reads but have not been considered as a feasible replacement for population-scale projects, being a combination of too expensive, not scalable enough or too error-prone. Here we develop an efficient and scalable wet lab and computational protocol, Napu, for Oxford Nanopore Technologies long-read sequencing that seeks to address those limitations. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the National Institutes of Health Center for Alzheimer's and Related Dementias. Using a single PromethION flow cell, we can detect single nucleotide polymorphisms with F1-score comparable to Illumina short-read sequencing. Small indel calling remains difficult within homopolymers and tandem repeats, but achieves good concordance to Illumina indel calls elsewhere. Further, we can discover structural variants with F1-score on par with state-of-the-art de novo assembly methods. Our protocol phases small and structural variants at megabase scales and produces highly accurate, haplotype-specific methylation calls.


Subject(s)
Genome, Human , Nanopore Sequencing , Humans , Sequence Analysis, DNA/methods , Haplotypes , Methylation , Pilot Projects , High-Throughput Nucleotide Sequencing/methods
3.
bioRxiv ; 2023 Feb 22.
Article in English | MEDLINE | ID: mdl-36865218

ABSTRACT

As a step towards simplifying and reducing the cost of haplotype resolved de novo assembly, we describe new methods for accurately phasing nanopore data with the Shasta genome assembler and a modular tool for extending phasing to the chromosome scale called GFAse. We test using new variants of Oxford Nanopore Technologies' (ONT) PromethION sequencing, including those using proximity ligation and show that newer, higher accuracy ONT reads substantially improve assembly quality.

4.
bioRxiv ; 2023 Apr 05.
Article in English | MEDLINE | ID: mdl-36711673

ABSTRACT

Long-read sequencing technologies substantially overcome the limitations of short-reads but to date have not been considered as feasible replacement at scale due to a combination of being too expensive, not scalable enough, or too error-prone. Here, we develop an efficient and scalable wet lab and computational protocol for Oxford Nanopore Technologies (ONT) long-read sequencing that seeks to provide a genuine alternative to short-reads for large-scale genomics projects. We applied our protocol to cell lines and brain tissue samples as part of a pilot project for the NIH Center for Alzheimer's and Related Dementias (CARD). Using a single PromethION flow cell, we can detect SNPs with F1-score better than Illumina short-read sequencing. Small indel calling remains to be difficult inside homopolymers and tandem repeats, but is comparable to Illumina calls elsewhere. Further, we can discover structural variants with F1-score comparable to state-of the-art methods involving Pacific Biosciences HiFi sequencing and trio information (but at a lower cost and greater throughput). Using ONT based phasing, we can then combine and phase small and structural variants at megabase scales. Our protocol also produces highly accurate, haplotype-specific methylation calls. Overall, this makes large-scale long-read sequencing projects feasible; the protocol is currently being used to sequence thousands of brain-based genomes as a part of the NIH CARD initiative. We provide the protocol and software as open-source integrated pipelines for generating phased variant calls and assemblies.

5.
Nature ; 611(7936): 519-531, 2022 Nov.
Article in English | MEDLINE | ID: mdl-36261518

ABSTRACT

The current human reference genome, GRCh38, represents over 20 years of effort to generate a high-quality assembly, which has benefitted society1,2. However, it still has many gaps and errors, and does not represent a biological genome as it is a blend of multiple individuals3,4. Recently, a high-quality telomere-to-telomere reference, CHM13, was generated with the latest long-read technologies, but it was derived from a hydatidiform mole cell line with a nearly homozygous genome5. To address these limitations, the Human Pangenome Reference Consortium formed with the goal of creating high-quality, cost-effective, diploid genome assemblies for a pangenome reference that represents human genetic diversity6. Here, in our first scientific report, we determined which combination of current genome sequencing and assembly approaches yield the most complete and accurate diploid genome assembly with minimal manual curation. Approaches that used highly accurate long reads and parent-child data with graph-based haplotype phasing during assembly outperformed those that did not. Developing a combination of the top-performing methods, we generated our first high-quality diploid reference assembly, containing only approximately four gaps per chromosome on average, with most chromosomes within ±1% of the length of CHM13. Nearly 48% of protein-coding genes have non-synonymous amino acid changes between haplotypes, and centromeric regions showed the highest diversity. Our findings serve as a foundation for assembling near-complete diploid human genomes at scale for a pangenome reference to capture global genetic variation from single nucleotides to structural rearrangements.


Subject(s)
Chromosome Mapping , Diploidy , Genome, Human , Genomics , Humans , Chromosome Mapping/standards , Genome, Human/genetics , Haplotypes/genetics , High-Throughput Nucleotide Sequencing/methods , High-Throughput Nucleotide Sequencing/standards , Sequence Analysis, DNA/methods , Sequence Analysis, DNA/standards , Reference Standards , Genomics/methods , Genomics/standards , Chromosomes, Human/genetics , Genetic Variation/genetics
6.
Nat Methods ; 18(11): 1322-1332, 2021 11.
Article in English | MEDLINE | ID: mdl-34725481

ABSTRACT

Long-read sequencing has the potential to transform variant detection by reaching currently difficult-to-map regions and routinely linking together adjacent variations to enable read-based phasing. Third-generation nanopore sequence data have demonstrated a long read length, but current interpretation methods for their novel pore-based signal have unique error profiles, making accurate analysis challenging. Here, we introduce a haplotype-aware variant calling pipeline, PEPPER-Margin-DeepVariant, that produces state-of-the-art variant calling results with nanopore data. We show that our nanopore-based method outperforms the short-read-based single-nucleotide-variant identification method at the whole-genome scale and produces high-quality single-nucleotide variants in segmental duplications and low-mappability regions where short-read-based genotyping fails. We show that our pipeline can provide highly contiguous phase blocks across the genome with nanopore reads, contiguously spanning between 85% and 92% of annotated genes across six samples. We also extend PEPPER-Margin-DeepVariant to PacBio HiFi data, providing an efficient solution with superior performance over the current WhatsHap-DeepVariant standard. Finally, we demonstrate de novo assembly polishing methods that use nanopore and PacBio HiFi reads to produce diploid assemblies with high accuracy (Q35+ nanopore-polished and Q40+ PacBio HiFi-polished).


Subject(s)
Genes , Haplotypes , High-Throughput Nucleotide Sequencing/methods , Nanopores , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/methods , Software , Genome, Human , Humans , Molecular Sequence Annotation
7.
Nat Biotechnol ; 38(9): 1044-1053, 2020 09.
Article in English | MEDLINE | ID: mdl-32686750

ABSTRACT

De novo assembly of a human genome using nanopore long-read sequences has been reported, but it used more than 150,000 CPU hours and weeks of wall-clock time. To enable rapid human genome assembly, we present Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN. Using a single PromethION nanopore sequencer and our toolkit, we assembled 11 highly contiguous human genomes de novo in 9 d. We achieved roughly 63× coverage, 42-kb read N50 values and 6.5× coverage in reads >100 kb using three flow cells per sample. Shasta produced a complete haploid human genome assembly in under 6 h on a single commercial compute node. MarginPolish and HELEN polished haploid assemblies to more than 99.9% identity (Phred quality score QV = 30) with nanopore reads alone. Addition of proximity-ligation sequencing enabled near chromosome-level scaffolds for all 11 genomes. We compare our assembly performance to existing methods for diploid, haploid and trio-binned human samples and report superior accuracy and speed.


Subject(s)
Genome, Human/genetics , High-Throughput Nucleotide Sequencing/methods , Nanopore Sequencing , Sequence Analysis, DNA/methods , Algorithms , Benchmarking , Chromosomes, Human/genetics , Deep Learning , Genomics , HLA Antigens/genetics , Haploidy , High-Throughput Nucleotide Sequencing/standards , Humans , Sequence Analysis, DNA/standards
8.
Genet Med ; 20(5): 495-502, 2018 04.
Article in English | MEDLINE | ID: mdl-29758565

ABSTRACT

PurposeWe describe a novel syndrome in seven female patients with extreme developmental delay and neoteny.MethodsAll patients in this study were female, aged 4 to 23 years, were well below the fifth percentile in height and weight, had failed to develop sexually, and lacked the use of language. Karyotype and array chromosome genomic hybridization analysis failed to identify large-scale structural variations. To further understand the underlying cause of disease in these patients, whole-genome sequencing was performed.ResultsIn five patients, coding de novo mutations (DNMs) were found in five different genes. These genes fell into similar functional categories of transcription regulation and chromatin modification. Comparison to a control population suggested that individuals with neotenic complex syndrome (NCS)-a name that we propose herein-could have an excess of rare inherited variants in genes associated with developmental delay and autism, although the difference was not significant.ConclusionWe describe an extreme form of developmental delay, with the defining characteristic of neoteny. In most patients we identified coding DNMs in a set of genes intolerant of haploinsufficiency; however, it is not clear whether these contributed to NCS. Rare inherited variants may also be associated with NCS, but more samples need to be analyzed to achieve statistical significance.


Subject(s)
Abnormalities, Multiple/diagnosis , Abnormalities, Multiple/genetics , Genetic Association Studies , Genetic Predisposition to Disease , Genetic Testing , Phenotype , Adolescent , Adult , Alleles , Amino Acid Substitution , Child , Child, Preschool , Facies , Female , Gene Frequency , Genetic Testing/methods , Genotype , Humans , Male , Syndrome , Whole Genome Sequencing , Young Adult
9.
Gigascience ; 5(1): 42, 2016 10 11.
Article in English | MEDLINE | ID: mdl-27724973

ABSTRACT

BACKGROUND: Since the completion of the Human Genome Project in 2003, it is estimated that more than 200,000 individual whole human genomes have been sequenced. A stunning accomplishment in such a short period of time. However, most of these were sequenced without experimental haplotype data and are therefore missing an important aspect of genome biology. In addition, much of the genomic data is not available to the public and lacks phenotypic information. FINDINGS: As part of the Personal Genome Project, blood samples from 184 participants were collected and processed using Complete Genomics' Long Fragment Read technology. Here, we present the experimental whole genome haplotyping and sequencing of these samples to an average read coverage depth of 100X. This is approximately three-fold higher than the read coverage applied to most whole human genome assemblies and ensures the highest quality results. Currently, 114 genomes from this dataset are freely available in the GigaDB repository and are associated with rich phenotypic data; the remaining 70 should be added in the near future as they are approved through the PGP data release process. For reproducibility analyses, 20 genomes were sequenced at least twice using independent LFR barcoded libraries. Seven genomes were also sequenced using Complete Genomics' standard non-barcoded library process. In addition, we report 2.6 million high-quality, rare variants not previously identified in the Single Nucleotide Polymorphisms database or the 1000 Genomes Project Phase 3 data. CONCLUSIONS: These genomes represent a unique source of haplotype and phenotype data for the scientific community and should help to expand our understanding of human genome evolution and function.


Subject(s)
Genome, Human , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , DNA/blood , Haplotypes , Humans , Reproducibility of Results
10.
Genome Res ; 22(4): 593-601, 2012 Apr.
Article in English | MEDLINE | ID: mdl-22267523

ABSTRACT

Hepatitis B virus (HBV) infection is a leading risk factor for hepatocellular carcinoma (HCC). HBV integration into the host genome has been reported, but its scale, impact and contribution to HCC development is not clear. Here, we sequenced the tumor and nontumor genomes (>80× coverage) and transcriptomes of four HCC patients and identified 255 HBV integration sites. Increased sequencing to 240× coverage revealed a proportionally higher number of integration sites. Clonal expansion of HBV-integrated hepatocytes was found specifically in tumor samples. We observe a diverse collection of genomic perturbations near viral integration sites, including direct gene disruption, viral promoter-driven human transcription, viral-human transcript fusion, and DNA copy number alteration. Thus, we report the most comprehensive characterization of HBV integration in hepatocellular carcinoma patients. Such widespread random viral integration will likely increase carcinogenic opportunities in HBV-infected individuals.


Subject(s)
Carcinoma, Hepatocellular/genetics , Genome, Human/genetics , Hepatitis B virus/genetics , Hepatitis B/genetics , Liver Neoplasms/genetics , Virus Integration/genetics , Base Sequence , Binding Sites/genetics , Carcinoma, Hepatocellular/virology , Female , Gene Expression Profiling/methods , Gene Expression Regulation, Neoplastic , Hepatitis B/virology , Hepatitis B virus/physiology , Host-Pathogen Interactions/genetics , Humans , Liver Neoplasms/virology , Male , Molecular Sequence Data , Mutation , Oligonucleotide Array Sequence Analysis , Sequence Analysis, DNA/methods , Transcriptome/genetics
11.
J Comput Biol ; 19(3): 279-92, 2012 Mar.
Article in English | MEDLINE | ID: mdl-22175250

ABSTRACT

Unchained base reads on self-assembling DNA nanoarrays have recently emerged as a promising approach to low-cost, high-quality resequencing of human genomes. Because of unique characteristics of these mated pair reads, existing computational methods for resequencing assembly, such as those based on map-consensus calling, are not adequate for accurate variant calling. We describe novel computational methods developed for accurate calling of SNPs and short substitutions and indels (<100 bp); the same methods apply to evaluation of hypothesized larger, structural variations. We use an optimization process that iteratively adjusts the genome sequence to maximize its a posteriori probability given the observed reads. For each candidate sequence, this probability is computed using Bayesian statistics with a simple read generation model and simplifying assumptions that make the problem computationally tractable. The optimization process iteratively applies one-base substitutions, insertions, and deletions until convergence is achieved to an optimum diploid sequence. A local de novo assembly procedure that generalizes approaches based on De Bruijn graphs is used to seed the optimization process in order to reduce the chance of converging to local optima. Finally, a correlation-based filter is applied to reduce the false positive rate caused by the presence of repetitive regions in the reference genome.


Subject(s)
Contig Mapping/methods , Genome, Human , Sequence Analysis, DNA/methods , Algorithms , Alleles , Base Sequence , Bayes Theorem , Chromosome Mapping , Computer Simulation , Data Interpretation, Statistical , Humans , Models, Genetic
12.
Science ; 327(5961): 78-81, 2010 Jan 01.
Article in English | MEDLINE | ID: mdl-19892942

ABSTRACT

Genome sequencing of large numbers of individuals promises to advance the understanding, treatment, and prevention of human diseases, among other applications. We describe a genome sequencing platform that achieves efficient imaging and low reagent consumption with combinatorial probe anchor ligation chemistry to independently assay each base from patterned nanoarrays of self-assembling DNA nanoballs. We sequenced three human genomes with this platform, generating an average of 45- to 87-fold coverage per genome and identifying 3.2 to 4.5 million sequence variants per genome. Validation of one genome data set demonstrates a sequence accuracy of about 1 false variant per 100 kilobases. The high accuracy, affordable cost of $4400 for sequencing consumables, and scalability of this platform enable complete human genome sequencing for the detection of rare variants in large-scale genetic studies.


Subject(s)
DNA/chemistry , Genome, Human , Microarray Analysis , Sequence Analysis, DNA/methods , Base Sequence , Computational Biology , Costs and Cost Analysis , DNA/genetics , Databases, Nucleic Acid , Genomic Library , Genotype , Haplotypes , Human Genome Project , Humans , Male , Nanostructures , Nanotechnology , Nucleic Acid Amplification Techniques , Polymorphism, Single Nucleotide , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/instrumentation , Sequence Analysis, DNA/standards , Software
13.
J Chem Inf Model ; 49(8): 1901-13, 2009 Aug.
Article in English | MEDLINE | ID: mdl-19610599

ABSTRACT

A fragment-based method for computing protein-ligand binding free energies by systematic sampling has been developed. Systematic sampling of fragment-protein interactions in translational and rotational space is followed by de novo assembly of fragments into molecules and computation of binding free energies for the molecules with statistical mechanics. The rigorous sampling provides independence from the choice of initial binding pose and assembling fragments enables evaluation of binding of a large number of molecule poses with relatively little computation. The method allows a full sampling of possible conformations and avoids the "conformational focusing" problem associated with free energy methods that sample only limited conformational and orientation changes from a starting pose. The direct computation of the entropy loss upon assembling fragments into molecules is an innovation for fragment-based methods. The computed binding free energies are compared to calorimetric data for a series of ligands for the T4 lysozyme L99A mutant and binding constants for a series of p38 MAP kinase ligands. In both cases, the standard error of prediction is close to 1 kcal/mol.


Subject(s)
Bacteriophage T4/enzymology , Computer Simulation , Muramidase/metabolism , p38 Mitogen-Activated Protein Kinases/metabolism , Humans , Ligands , Models, Molecular , Muramidase/chemistry , Protein Binding , Protein Conformation , Proteins/chemistry , Proteins/metabolism , Thermodynamics , p38 Mitogen-Activated Protein Kinases/chemistry
14.
J Am Chem Soc ; 125(47): 14244-5, 2003 Nov 26.
Article in English | MEDLINE | ID: mdl-14624550

ABSTRACT

Using normal modes to generate torsion space moves in Monte Carlo simulations of peptides and proteins is not a new idea; nevertheless, despite its power it has not received widespread application. We show that such a "Modal Monte Carlo" approach is an efficient tool for ab initio predictions of small-protein structures. We apply this method to the Trp cage, a 20-residue polypeptide designed to fold rapidly into a structure that includes tertiary contacts, despite its short length. We achieve a high-quality ab initio structure prediction in about 2 orders of magnitude less computation time than state of the art molecular dynamics techniques.


Subject(s)
Models, Molecular , Monte Carlo Method , Peptides/chemistry , Recombinant Proteins/chemistry , Computer Simulation , Nuclear Magnetic Resonance, Biomolecular , Protein Folding
SELECTION OF CITATIONS
SEARCH DETAIL
...